Introduction
Welcome to the my final STAT 228 project! This time, I’ll be analyzing my own listening history data from Spotify. One of my favotrite aspects of Spotify is its ability to make song suggestions and generate playlists based on your listening history. I love the playlists it generates for me and I look forward to Spotify Wrapped every year!
In this project, I’ll spend some time looking at my listening history itself. Then, in keeping with Spotify’s most iconic feature, I’ll build a model and predict some aspects of my listening history!
For this project, I’ll be using the spotifyr package from TidyTuesday. To access my own data, I followed an incredibly helpful tutorial from Charlie Thompson who also contributed towards the spotifyr package.
Let’s get started!
Setting up Spotify
First, I had to set up a Dev account with Spotify to access their Web API, giving me access to the IDs needed to pull my access token and, in doing so, my Spotify data. More detailed instructions for this are on Charlie Thompson’s tutorial. Once the access token is pulled, every subsequent Spotify function will refer to it; no need to call it every time.
The code should look like this:
Sys.setenv(SPOTIFY_CLIENT_ID = 'xxxxxxxxxxxxxxxxxxxxx')
Sys.setenv(SPOTIFY_CLIENT_SECRET = 'xxxxxxxxxxxxxxxxxxxxx')
access_token <- get_spotify_access_token()
Taking a peek at my data
Now that my Spotify is all set up, let’s see who my favorite artists are.
get_my_top_artists_or_tracks(type = 'artists',
time_range = 'long_term', limit = 5) %>%
select(name, genres) %>%
rowwise %>%
mutate(genres = paste(genres, collapse = ', ')) %>%
ungroup
## # A tibble: 5 × 2
## name genres
## <chr> <chr>
## 1 David Bowie art rock, classic rock, glam rock, permanent wave, rock
## 2 Amy Winehouse british soul, indie r&b, neo soul
## 3 Gorillaz alternative hip hop
## 4 Ariana Grande dance pop, pop
## 5 Talking Heads art punk, art rock, dance rock, funk rock, new wave, permanent …
Ok, this is pretty cool. Let’s look at my top songs now!
get_my_top_artists_or_tracks(type = 'tracks', time_range = 'long_term', limit = 5) %>%
mutate(artist.name = map_chr(artists, function(x) x$name[1])) %>%
select(name, artist.name, album.name)
## name artist.name
## 1 Amy Amy Amy Amy Winehouse
## 2 Doing Yoga Kazy Lambist
## 3 Toi Et Moi Paradis
## 4 Station to Station - 2016 Remaster David Bowie
## 5 Fed Up Stephen Marley
## album.name
## 1 Frank
## 2 On You (Radio Edit) - Single
## 3 Recto Verso
## 4 Station to Station (2016 Remaster)
## 5 Mind Control
As much as I love all five of these songs, I’m a little surprised by them. Well, the data doesn’t lie! Since Amy is one of my top artists and is at the very top of my most-played songs, let’s start with her.
Comparing my favorite artists
Amy Winehouse
First, I’ll create the winehouse dataset that includes all information that Spotify has about her music. Next, let’s look at valence, a Spotify statistic for measuring the joy of a song; higher valence = more joyful.
winehouse <- get_artist_audio_features('Amy Winehouse')
winehouse %>%
arrange(-valence) %>%
select(track_name, valence) %>%
head(5)
## track_name valence
## 1 Monkey Man - Live On Jools Holland Hootenanny / 2006 0.926
## 2 Monkey Man 0.923
## 3 You're Wondering Now 0.917
## 4 Tenderly - Live On Later... With Jools Holland / 2006 0.886
## 5 Stronger Than Me 0.855
Now, let’s take a look at valence for each of her albums.
winehouse %>%
filter(album_name %in% c("Frank", "Back To Black", "Lioness: Hidden Treasures", "AMY (Original Motion Picture Soundtrack)", "At The BBC")) %>%
ggplot(aes(x = valence, y = fct_reorder(album_name, album_release_year))) +
geom_density_ridges(color = "palevioletred3", fill = "palevioletred2") +
labs(title = "Joyplot of Amy Winehouse's joy distributions", subtitle = "Based on each album's valence",
x = "Valence (joy)", y = "") +
theme_minimal()

This is interesting — valence is simlar for most albums but AMY, the motion picture soundtrack. Having seen the movie, this makes a lot of sense. Amy Winehouse died in 2011, shortly after Lioness: Hidden Treasures was released. AMY is a documentary about her life, success, and tragic downfall. It makes sense that the music selected for the film would, therefore, be a lot less joyful.
David Bowie
Valence
David Bowie, my top-listened-to artist for years, can’t escape my data analysis! Let’s do the same thing, comparing the valence of his studio albums.
bowie <- get_artist_audio_features('david bowie')
bowie %>%
ggplot(aes(x = valence, y = fct_reorder(album_name, album_release_year))) +
geom_density_ridges() +
labs(title = "Joyplot of David Bowie's joy distributions", subtitle = "Based on each album's valence",
x = "Valence (joy)", y = "") +
theme_minimal()

Woah… that’s a lot of albums. To be fair, David Bowie was an extremely prolific artist; Amy Winehouse only had 5 albums on her Spotify but also died very young at 27. David Bowie’s career spanned 6 decades, not including posthumous releases.
Let’s modify this graph slightly to reflect year rather than album name. The album names themselves don’t add much to this visualization. I think it would be more interesting to simplify and emphasize the change in valence over time.
bowie %>%
mutate(decade = floor(album_release_year/10) * 10) %>%
group_by(decade) %>%
#summarize(valence_by_decade = mean(valence)) %>%
ggplot(aes(x = valence, y = factor(decade))) +
geom_density_ridges(aes(fill = decade, color = decade)) +
labs(title = "Joyplot of David Bowie's joy distributions", subtitle = "Based on each decade's valence",
x = "Valence (joy)", y = "") +
scale_y_discrete(labels = c("1960s", "1970s", "1980s", "1990s", "2000s", "2010s", "2020s")) +
theme_minimal() +
theme(legend.position = "none")

Now this looks much better! Because of the sheer number of albums David Bowie has produced, plotting valence by decade makes a lot of sense. From the visualization above, it looks like his career has remained fairly constant. Valence (joyfulness) of his music started higher and slowly dipped, looking lowest in the 80s before coming back up a little. That’s surprising since the 80s were full of upbeat, danceable music, including Bowie’s Let’s Dance album from 1983.
Winehouse VS Bowie
Before we move on to modeling, it’s probably worthwhile to compare the two musicians we’ve looked at so far. Let’s compare their overall valence.
ggplot() +
geom_density(data = bowie, aes(x = valence), color = "slateblue", fill = "slateblue", alpha = 0.5) +
geom_density(data = winehouse, aes(x = valence), color = "royalblue", fill = "royalblue", alpha = 0.5) +
labs(title = "Amy Winehouse VS David Bowie's Music Valence", x = "Valence (joy)", y = "") +
annotate("text", x = 0.68, y = 1.95, size = 4,
label = "Amy Winehouse", color = "royalblue") +
annotate("text", x = 0.39, y = 1.75, size = 4,
label = "David Bowie", color = "slateblue") +
theme_minimal() +
theme(axis.ticks.y = element_blank(), axis.text.y = element_blank())

Amy Winehouse has two distinctive peaks while David Bowie has one. I’m guessing that AMY definitely contributed to the height of that lower peak. Regardless, Amy Winehouse’s music seems to be a little more joyful than David Bowie’s.
Modeling
So far, we’ve looked a lot at valence, Spotify’s measure of a song’s joyfulness. The focus has primarily been on Amy Winehouse and David Bowie, my two top listened-to artists.
I wonder if we can build a model that predicts whether a song belongs to David Bowie or Amy Winehouse based on it’s spotify statistics: valence, danceability, speechiness,?
Creating our new dataset
First, let’s get the dataset we’ll be working with by using rbind() to add the bowie data to the winehouse data. We also have to factor artist_name in order for the classification models to work correctly.
bwjoined = rbind(winehouse, bowie)
bwjoined = bwjoined %>%
mutate(artist_name = factor(artist_name))
Now that we have both of their music data in one dataset, let’s start making our models!
Descision Tree
We’re going to build a decision tree by using Spotify’s statistics to predict the artist. The features we’ll be using are:
danceability
energy
loudness
speechiness
acousticness
instrumentalness
liveness
tempo
and, of course, valence!
Now it’s time to create the model. First, we set up by splitting the data:
# 1 Data splitting
set.seed(18) # it's a good number!
bw_split = initial_split(bwjoined, prop = 0.7)
bw_train = training(bw_split)
bw_test = testing(bw_split)
Now, we fit the model to a classification decision tree…
# 2 Fit the model
bw_tree =
decision_tree() %>%
set_mode(mode = "classification") %>%
fit(artist_name ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + tempo + valence, data = bw_train)
…and generate predictions…
# 3 Generate predictions
bw_pred = bw_tree %>%
predict(new_data = bw_test)
…and finally, generate prediction metrics.
# 4 Prediction metrics
bw_test %>%
mutate(artist_pred = bw_pred$.pred_class) %>%
accuracy(estimate = artist_pred, truth = artist_name)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.912
The accuracy of this model is 0.91 which is very high. Because the accuracy is so high, it might be worth looking at a confusion matrix to see how the predicitons are laid out.
bw_test %>%
mutate(artist_pred = bw_pred$.pred_class) %>%
conf_mat(estimate = artist_pred, truth = artist_name)
## Truth
## Prediction Amy Winehouse David Bowie
## Amy Winehouse 57 14
## David Bowie 29 389
According to this confusion matrix, David Bowie is predicted correctly most of the time, but Amy Winehouse has a lower accuracy. It’s not extreme as she’s still predicted correctly the majority of the time, but it’s noticeable. This might be because David Bowie has many more songs than Amy Winehouse so the model is more inclined to guess Bowie from the start. If 80% of the dataset consists of Bowie’s music, even if the model guessed David Bowie every single time, it would still have 80% accuracy.
Let’s see if any of these other models can improve upon this issue by learning from their mistakes (literally!).
Bagged Decision Tree
A bagged decision tree is a model that builds many trees on different bootstrap samples of the training data, which hopefully lowers the model’s variance and keeps bias low.
# 1 Data splitting
set.seed(18) # it's a good number!
bw_split = initial_split(bwjoined, prop = 0.7)
bw_train = training(bw_split)
bw_test = testing(bw_split)
# 2 Fit the model
bw_bag_tree =
bag_tree() %>%
set_mode(mode = "classification") %>%
fit(artist_name ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + tempo + valence, data = bw_train)
# 3 Generate predictions
bw_pred = bw_bag_tree %>%
predict(new_data = bw_test)
# 4 Prediction metrics
bw_test %>%
mutate(artist_pred = bw_pred$.pred_class) %>%
accuracy(estimate = artist_pred, truth = artist_name)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.933
bw_test %>%
mutate(artist_pred = bw_pred$.pred_class) %>%
conf_mat(estimate = artist_pred, truth = artist_name)
## Truth
## Prediction Amy Winehouse David Bowie
## Amy Winehouse 58 5
## David Bowie 28 398
In this bagged decision tree, the accuracy is now 0.93 but not much has changed. If anything, guess accuracy for David Bowie is higher but Amy Winehouse’s remains the same.
Let’s try a random forest next. Similar to a bagged decision tree, random forests use a random subset of features at each split in the tree.
Random Forest
# 1 Data splitting
set.seed(18) # it's a good number!
bw_split = initial_split(bwjoined, prop = 0.7)
bw_train = training(bw_split)
bw_test = testing(bw_split)
# 2 Fit the model
bw_rf =
rand_forest() %>%
set_mode(mode = "classification") %>%
fit(artist_name ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + tempo + valence, data = bw_train)
# 3 Generate predictions
bw_pred = bw_rf %>%
predict(new_data = bw_test)
# 4 Prediction metrics
bw_test %>%
mutate(artist_pred = bw_pred$.pred_class) %>%
accuracy(estimate = artist_pred, truth = artist_name)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.933
bw_test %>%
mutate(artist_pred = bw_pred$.pred_class) %>%
conf_mat(estimate = artist_pred, truth = artist_name)
## Truth
## Prediction Amy Winehouse David Bowie
## Amy Winehouse 56 3
## David Bowie 30 400
With a similar outcome as the bagged tree, random forests produce a very similar output.
Gradient Boosting Model (GBM)
Gradient Boosting Models (GBMs) have become a very popular model. This is because they begin with a very simple tree and learn from the mistakes of the previous tree in the ensemble (previously generated tree in the GBM). They have options that we can tweak but we’ll keep it simple.
# 1 Data splitting
set.seed(18) # it's a good number!
bw_split = initial_split(bwjoined, prop = 0.7)
bw_train = training(bw_split)
bw_test = testing(bw_split)
# 2 Fit the model
bw_boost =
boost_tree() %>%
set_mode(mode = "classification") %>%
fit(artist_name ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + tempo + valence, data = bw_train)
# 3 Generate predictions
bw_pred = bw_boost %>%
predict(new_data = bw_test)
# 4 Prediction metrics
bw_test %>%
mutate(artist_pred = bw_pred$.pred_class) %>%
accuracy(estimate = artist_pred, truth = artist_name)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.937
bw_test %>%
mutate(artist_pred = bw_pred$.pred_class) %>%
conf_mat(estimate = artist_pred, truth = artist_name)
## Truth
## Prediction Amy Winehouse David Bowie
## Amy Winehouse 57 2
## David Bowie 29 401
This model definitely has the highest accuracy at 0.94. This model is fantastic at predicting David Bowie’s music but not much change has happened with Amy Winehouse’s music. Again, I attribute part of this to the difference in sample sizes but her music could just be more difficult to predict.
Bonus: Adding More Artists!
Just for fun, let’s use a GBM again to predict which artist is which… but try it across all five of my top artists!
First, to set up the dataset:
# Get the data from my other three top artists
gorillaz <- get_artist_audio_features('Gorillaz')
grande <- get_artist_audio_features('Ariana Grande')
theads <- get_artist_audio_features('Talking Heads')
# Combine all of the artists' data
all_artists = rbind(winehouse, bowie, gorillaz, grande, theads)
# Factor artist_name to use it for our GBM!
all_artists = all_artists %>%
mutate(artist_name = factor(artist_name))
Now that the data is all set, we can fit our GBM!
# 1 Data splitting
set.seed(18) # it's a good number!
all_split = initial_split(all_artists, prop = 0.7)
all_train = training(all_split)
all_test = testing(all_split)
# 2 Fit the model
all_boost =
boost_tree() %>%
set_mode(mode = "classification") %>%
fit(artist_name ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + tempo + valence, data = all_train)
# 3 Generate predictions
all_pred = all_boost %>%
predict(new_data = all_test)
# 4 Prediction metrics
all_test %>%
mutate(artist_pred = all_pred$.pred_class) %>%
accuracy(estimate = artist_pred, truth = artist_name)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.839
all_test %>%
mutate(artist_pred = all_pred$.pred_class) %>%
conf_mat(estimate = artist_pred, truth = artist_name)
## Truth
## Prediction Amy Winehouse Ariana Grande David Bowie Gorillaz Talking Heads
## Amy Winehouse 56 2 5 3 0
## Ariana Grande 6 126 7 1 2
## David Bowie 18 1 372 23 27
## Gorillaz 3 2 7 55 1
## Talking Heads 3 0 11 9 72
The accuracy is 0.84. That’s not all that bad, actually! In this case, since we have so many more targets (artists), the confusion matrix suits it name — it’s pretty confusing. Still, I can see that Ariana Grande and David Bowie have the highest accuracy. In one more test, I wonder how a model would do if it just had to look at those two musicians?
Bonus II: I Can’t Stop Modeling!
Let’s just do this all in one go:
# Combine all of the artists' data
bowie_grande = rbind(bowie, grande)
# Factor artist_name to use it for our GBM!
bowie_grande = bowie_grande %>%
mutate(artist_name = factor(artist_name))
# 1 Data splitting
set.seed(18) # it's a good number!
bg_split = initial_split(bowie_grande, prop = 0.7)
bg_train = training(bg_split)
bg_test = testing(bg_split)
# 2 Fit the model
bg_boost =
boost_tree() %>%
set_mode(mode = "classification") %>%
fit(artist_name ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + tempo + valence, data = bg_train)
# 3 Generate predictions
bg_pred = bg_boost %>%
predict(new_data = bg_test)
# 4 Prediction metrics
bg_test %>%
mutate(artist_pred = bg_pred$.pred_class) %>%
accuracy(estimate = artist_pred, truth = artist_name)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.983
bg_test %>%
mutate(artist_pred = bg_pred$.pred_class) %>%
conf_mat(estimate = artist_pred, truth = artist_name)
## Truth
## Prediction Ariana Grande David Bowie
## Ariana Grande 123 3
## David Bowie 6 407
0.98 accuracy!!! That’s darn good, especially since the confusion matrix isn’t sketchy about it!
---
title: "We Could Be Data Scientists, Just For One Day..."
author: "Kiki Regan"
date: "May 12, 2022"
output: 
  html_document:
    theme: flatly
    toc: true
    toc_float: true
    code_download: true
---

```{r, warning = FALSE, message = FALSE}
library(tidyverse)
library(tidymodels)
library(baguette)
library(dplyr)
library(spotifyr)
library(genius)
library(ggridges)
library(ggthemes)
```

# Introduction

Welcome to the my final STAT 228 project! This time, I'll be analyzing *my own* listening history data from Spotify. One of my favotrite aspects of Spotify is its ability to make song suggestions and generate playlists based on your listening history. I love the playlists it generates for me and I look forward to Spotify Wrapped every year! 

In this project, I'll spend some time looking at my listening history itself. Then, in keeping with Spotify's most iconic feature, I'll build a model and **predict** some aspects of my listening history!

For this project, I'll be using the `spotifyr` package from [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md). To access my own data, I followed an incredibly helpful tutorial from [Charlie Thompson](https://www.rcharlie.com/spotifyr/) who also contributed towards the `spotifyr` package.

Let's get started!

## Setting up Spotify

![*My face when I found out you could access your own Spotify data*](http://www.reactiongifs.com/wp-content/uploads/2014/01/Shocked-David-Bowie.gif)

First, I had to set up a Dev account with Spotify to access their Web API, giving me access to the IDs needed to pull my access token and, in doing so, my Spotify data. More detailed instructions for this are on [Charlie Thompson's tutorial](https://www.rcharlie.com/spotifyr/). Once the access token is pulled, every subsequent Spotify function will refer to it; no need to call it every time.

The code should look like this:

```{r, echo = FALSE, warning = FALSE, message = FALSE}
spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

#View(spotify_songs)
```
```{r, echo = FALSE, warning = FALSE, message = FALSE}
Sys.setenv(SPOTIFY_CLIENT_ID = '006d97345457456fb59991855957d268')
Sys.setenv(SPOTIFY_CLIENT_SECRET = '80f3c69de97c4a1a967735b8da498963')

access_token <- get_spotify_access_token()
```

```{r, warning = FALSE, message = FALSE, eval = FALSE}
Sys.setenv(SPOTIFY_CLIENT_ID = 'xxxxxxxxxxxxxxxxxxxxx')
Sys.setenv(SPOTIFY_CLIENT_SECRET = 'xxxxxxxxxxxxxxxxxxxxx')

access_token <- get_spotify_access_token()
```

## Taking a peek at my data

Now that my Spotify is all set up, let's see who my favorite artists are. 

```{r, warning = FALSE, message = FALSE}
get_my_top_artists_or_tracks(type = 'artists', 
                             time_range = 'long_term', limit = 5) %>% 
    select(name, genres) %>% 
    rowwise %>% 
    mutate(genres = paste(genres, collapse = ', ')) %>% 
    ungroup 
```

Ok, this is pretty cool. Let's look at my top songs now!

```{r, warning = FALSE, message = FALSE}
get_my_top_artists_or_tracks(type = 'tracks', time_range = 'long_term', limit = 5) %>% 
    mutate(artist.name = map_chr(artists, function(x) x$name[1])) %>% 
    select(name, artist.name, album.name)
```

As much as I love all five of these songs, I'm a little surprised by them. Well, the data doesn't lie! Since Amy is one of my top artists and is at the very top of my most-played songs, let's start with her.

# Comparing my favorite artists

## Amy Winehouse

First, I'll create the `winehouse` dataset that includes all information that Spotify has about her music. Next, let's look at **valence**, a Spotify statistic for measuring the **joy** of a song; higher valence = more joyful.

```{r, warning = FALSE, message = FALSE}
winehouse <- get_artist_audio_features('Amy Winehouse')
winehouse %>% 
    arrange(-valence) %>% 
    select(track_name, valence) %>% 
    head(5) 
```

Now, let's take a look at valence for each of her albums.

```{r, warning = FALSE, message = FALSE}
winehouse %>% 
  filter(album_name %in% c("Frank", "Back To Black", "Lioness: Hidden Treasures", "AMY (Original Motion Picture Soundtrack)", "At The BBC")) %>% 
ggplot(aes(x = valence, y = fct_reorder(album_name, album_release_year))) + 
    geom_density_ridges(color = "palevioletred3", fill = "palevioletred2") + 
    labs(title = "Joyplot of Amy Winehouse's joy distributions", subtitle = "Based on each album's valence", 
            x = "Valence (joy)", y = "") +
  theme_minimal()
```

This is interesting — valence is simlar for most albums but *AMY*, the motion picture soundtrack. Having seen the movie, this makes a lot of sense. Amy Winehouse died in 2011, shortly after *Lioness: Hidden Treasures* was released. *AMY* is a documentary about her life, success, and tragic downfall. It makes sense that the music selected for the film would, therefore, be a lot less joyful.

## David Bowie

### Valence

David Bowie, my top-listened-to artist for years, can't escape my data analysis! Let's do the same thing, comparing the valence of his studio albums. 

```{r, warning = FALSE, message = FALSE}
bowie <- get_artist_audio_features('david bowie')

bowie %>% 
  ggplot(aes(x = valence, y = fct_reorder(album_name, album_release_year))) + 
    geom_density_ridges() + 
    labs(title = "Joyplot of David Bowie's joy distributions", subtitle = "Based on each album's valence", 
            x = "Valence (joy)", y = "") +
  theme_minimal()
```

Woah... that's a *lot* of albums. To be fair, David Bowie was an *extremely* prolific artist; Amy Winehouse only had 5 albums on her Spotify but also died very young at 27. David Bowie's career spanned 6 decades, not including posthumous releases. 

Let's modify this graph *slightly* to reflect year rather than album name. The album names themselves don't add much to this visualization. I think it would be more interesting to simplify and emphasize the change in valence over time.

```{r, warning = FALSE, message = FALSE}
bowie %>% 
  mutate(decade = floor(album_release_year/10) * 10) %>% 
  group_by(decade) %>% 
  #summarize(valence_by_decade = mean(valence)) %>% 
  ggplot(aes(x = valence, y = factor(decade))) + 
    geom_density_ridges(aes(fill = decade, color = decade)) + 
    labs(title = "Joyplot of David Bowie's joy distributions", subtitle = "Based on each decade's valence", 
            x = "Valence (joy)", y = "") +
  scale_y_discrete(labels = c("1960s", "1970s", "1980s", "1990s", "2000s", "2010s", "2020s")) +
  theme_minimal() +
  theme(legend.position = "none")
```

Now this looks much better! Because of the sheer number of albums David Bowie has produced, plotting valence by decade makes a lot of sense. From the visualization above, it looks like his career has remained fairly constant. Valence (joyfulness) of his music started higher and slowly dipped, looking lowest in the 80s before coming back up a little. That's surprising since the 80s were full of upbeat, danceable music, including Bowie's *Let's Dance* album from 1983. 

# Winehouse VS Bowie

Before we move on to modeling, it's probably worthwhile to compare the two musicians we've looked at so far. Let's compare their overall **valence**.

```{r, warning = FALSE, message = FALSE}
ggplot() + 
  geom_density(data = bowie, aes(x = valence), color = "slateblue", fill = "slateblue", alpha = 0.5) +
  geom_density(data = winehouse, aes(x = valence), color = "royalblue", fill = "royalblue", alpha = 0.5) +
  labs(title = "Amy Winehouse VS David Bowie's Music Valence", x = "Valence (joy)", y = "") +
  annotate("text", x = 0.68, y = 1.95, size = 4,
           label = "Amy Winehouse", color = "royalblue") +
  annotate("text", x = 0.39, y = 1.75, size = 4,
           label = "David Bowie", color = "slateblue") +
  theme_minimal() +
  theme(axis.ticks.y = element_blank(), axis.text.y = element_blank())
```

Amy Winehouse has two distinctive peaks while David Bowie has one. I'm guessing that *AMY* definitely contributed to the height of that lower peak. Regardless, Amy Winehouse's music seems to be a little more joyful than David Bowie's. 

# Modeling

So far, we've looked a lot at valence, Spotify's measure of a song's joyfulness. The focus has primarily been on Amy Winehouse and David Bowie, my two top listened-to artists. 

I wonder if we can build a model that predicts whether a song belongs to David Bowie or Amy Winehouse based on it's spotify statistics: **valence, danceability, speechiness,**?

## Creating our new dataset

First, let's get the dataset we'll be working with by using `rbind()` to add the `bowie` data to the `winehouse` data. We also have to factor `artist_name` in order for the classification models to work correctly.

```{r, warning = FALSE, message = FALSE}
bwjoined = rbind(winehouse, bowie)

bwjoined = bwjoined %>% 
  mutate(artist_name = factor(artist_name))
```

Now that we have both of their music data in one dataset, let's start making our models!

## Descision Tree

We're going to build a **decision tree** by using Spotify's statistics to predict the artist. The features we'll be using are:

- `danceability`

- `energy`

- `loudness`

- `speechiness`

- `acousticness`

- `instrumentalness`

- `liveness`

- `tempo`

- and, of course, `valence`!

Now it's time to create the model. First, we set up by splitting the data:

```{r, warning = FALSE, message = FALSE}
# 1 Data splitting
set.seed(18) # it's a good number! 
bw_split = initial_split(bwjoined, prop = 0.7)

bw_train = training(bw_split)
bw_test = testing(bw_split)
```

Now, we fit the model to a classification decision tree...

```{r, warning = FALSE, message = FALSE}
# 2 Fit the model
bw_tree = 
  decision_tree() %>%
  set_mode(mode = "classification") %>%
  fit(artist_name ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + tempo + valence, data = bw_train)
```

...and generate predictions...

```{r, warning = FALSE, message = FALSE}
# 3 Generate predictions
bw_pred = bw_tree %>%
  predict(new_data = bw_test)
```

...and finally, generate prediction metrics. 

```{r, warning = FALSE, message = FALSE}
# 4 Prediction metrics
bw_test %>%
  mutate(artist_pred = bw_pred$.pred_class) %>%
  accuracy(estimate = artist_pred, truth = artist_name)
```

The accuracy of this model is **0.91** which is *very* high. Because the accuracy is so high, it might be worth looking at a **confusion matrix** to see how the predicitons are laid out.

```{r, warning = FALSE, message = FALSE}
bw_test %>% 
  mutate(artist_pred = bw_pred$.pred_class) %>%
  conf_mat(estimate = artist_pred, truth = artist_name)
```

According to this confusion matrix, David Bowie is predicted correctly most of the time, but Amy Winehouse has a lower accuracy. It's not extreme as she's still predicted correctly the majority of the time, but it's noticeable. This might be because David Bowie has many more songs than Amy Winehouse so the model is more inclined to guess Bowie from the start. If 80% of the dataset consists of Bowie's music, even if the model guessed David Bowie every single time, it would still have 80% accuracy.

Let's see if any of these other models can improve upon this issue by learning from their mistakes (literally!).

## Bagged Decision Tree

A **bagged decision tree** is a model that builds many trees on different bootstrap samples of the training data, which hopefully lowers the model's variance and keeps bias low.

```{r, warning = FALSE, message = FALSE}
# 1 Data splitting
set.seed(18) # it's a good number! 
bw_split = initial_split(bwjoined, prop = 0.7)

bw_train = training(bw_split)
bw_test = testing(bw_split)

# 2 Fit the model
bw_bag_tree = 
  bag_tree() %>%
  set_mode(mode = "classification") %>%
  fit(artist_name ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + tempo + valence, data = bw_train)

# 3 Generate predictions
bw_pred = bw_bag_tree %>%
  predict(new_data = bw_test)

# 4 Prediction metrics
bw_test %>%
  mutate(artist_pred = bw_pred$.pred_class) %>%
  accuracy(estimate = artist_pred, truth = artist_name)

bw_test %>%
  mutate(artist_pred = bw_pred$.pred_class) %>%
  conf_mat(estimate = artist_pred, truth = artist_name)
```

In this bagged decision tree, the accuracy is now 0.93 but not much has changed. If anything, guess accuracy for David Bowie is higher but Amy Winehouse's remains the same.

Let's try a **random forest** next. Similar to a bagged decision tree, random forests use a *random subset of features* at each split in the tree.

## Random Forest

```{r, warning = FALSE, message = FALSE}
# 1 Data splitting
set.seed(18) # it's a good number! 
bw_split = initial_split(bwjoined, prop = 0.7)

bw_train = training(bw_split)
bw_test = testing(bw_split)

# 2 Fit the model
bw_rf = 
  rand_forest() %>%
  set_mode(mode = "classification") %>%
  fit(artist_name ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + tempo + valence, data = bw_train)

# 3 Generate predictions
bw_pred = bw_rf %>%
  predict(new_data = bw_test)

# 4 Prediction metrics
bw_test %>%
  mutate(artist_pred = bw_pred$.pred_class) %>%
  accuracy(estimate = artist_pred, truth = artist_name)

bw_test %>%
  mutate(artist_pred = bw_pred$.pred_class) %>%
  conf_mat(estimate = artist_pred, truth = artist_name)
```

With a similar outcome as the bagged tree, random forests produce a very similar output. 

## Gradient Boosting Model (GBM)

**Gradient Boosting Models** (GBMs) have become a very popular model. This is because they begin with a very simple tree and learn from the mistakes of the previous tree in the ensemble (previously generated tree in the GBM). They have options that we can tweak but we'll keep it simple.

```{r, message = FALSE, warning = FALSE}
# 1 Data splitting
set.seed(18) # it's a good number! 
bw_split = initial_split(bwjoined, prop = 0.7)

bw_train = training(bw_split)
bw_test = testing(bw_split)

# 2 Fit the model
bw_boost = 
  boost_tree() %>%
  set_mode(mode = "classification") %>%
  fit(artist_name ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + tempo + valence, data = bw_train)

# 3 Generate predictions
bw_pred = bw_boost %>%
  predict(new_data = bw_test)

# 4 Prediction metrics
bw_test %>%
  mutate(artist_pred = bw_pred$.pred_class) %>%
  accuracy(estimate = artist_pred, truth = artist_name)

bw_test %>%
  mutate(artist_pred = bw_pred$.pred_class) %>%
  conf_mat(estimate = artist_pred, truth = artist_name)
```

This model definitely has the highest accuracy at **0.94**. This model is fantastic at predicting David Bowie's music but not much change has happened with Amy Winehouse's music. Again, I attribute part of this to the difference in sample sizes but her music could just be more difficult to predict. 

## Bonus: Adding More Artists!

Just for fun, let's use a **GBM** again to predict which artist is which... but try it across *all five* of my top artists!

First, to set up the dataset:

```{r, message=FALSE, warning=FALSE}
# Get the data from my other three top artists
gorillaz <- get_artist_audio_features('Gorillaz')
grande <- get_artist_audio_features('Ariana Grande')
theads <- get_artist_audio_features('Talking Heads')

# Combine all of the artists' data
all_artists = rbind(winehouse, bowie, gorillaz, grande, theads)

# Factor artist_name to use it for our GBM!
all_artists = all_artists %>% 
  mutate(artist_name = factor(artist_name))
```

Now that the data is all set, we can fit our GBM!

```{r, message=FALSE, warning=FALSE}
# 1 Data splitting
set.seed(18) # it's a good number! 
all_split = initial_split(all_artists, prop = 0.7)

all_train = training(all_split)
all_test = testing(all_split)

# 2 Fit the model
all_boost = 
  boost_tree() %>%
  set_mode(mode = "classification") %>%
  fit(artist_name ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + tempo + valence, data = all_train)

# 3 Generate predictions
all_pred = all_boost %>%
  predict(new_data = all_test)

# 4 Prediction metrics
all_test %>%
  mutate(artist_pred = all_pred$.pred_class) %>%
  accuracy(estimate = artist_pred, truth = artist_name)

all_test %>%
  mutate(artist_pred = all_pred$.pred_class) %>%
  conf_mat(estimate = artist_pred, truth = artist_name)
```

The accuracy is **0.84**. That's not all that bad, actually! In this case, since we have so many more targets (artists), the confusion matrix suits it name — it's pretty confusing. Still, I can see that Ariana Grande and David Bowie have the highest accuracy. In *one more test*, I wonder how a model would do if it just had to look at those two musicians?

## Bonus II: I Can't Stop Modeling!

Let's just do this all in one go:

```{r, message=FALSE, warning=FALSE}
# Combine all of the artists' data
bowie_grande = rbind(bowie, grande)

# Factor artist_name to use it for our GBM!
bowie_grande = bowie_grande %>% 
  mutate(artist_name = factor(artist_name))

# 1 Data splitting
set.seed(18) # it's a good number! 
bg_split = initial_split(bowie_grande, prop = 0.7)

bg_train = training(bg_split)
bg_test = testing(bg_split)

# 2 Fit the model
bg_boost = 
  boost_tree() %>%
  set_mode(mode = "classification") %>%
  fit(artist_name ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + tempo + valence, data = bg_train)

# 3 Generate predictions
bg_pred = bg_boost %>%
  predict(new_data = bg_test)

# 4 Prediction metrics
bg_test %>%
  mutate(artist_pred = bg_pred$.pred_class) %>%
  accuracy(estimate = artist_pred, truth = artist_name)

bg_test %>%
  mutate(artist_pred = bg_pred$.pred_class) %>%
  conf_mat(estimate = artist_pred, truth = artist_name)
```

**0.98 accuracy!!!** That's darn good, especially since the confusion matrix isn't sketchy about it! 

# Conclusion

Throughout this final project, a lot of cool stats were analyzed. I learned about Winehouse, Bowie, and how distinct some of the music that I listen to can be — David Bowie and Ariana Grande in particular.

I hope this data analysis was interesting and I encourage you to analyze your own Spotify data! Thank you for reading this post and I hope you have a great day!

![](https://chabechrod.files.wordpress.com/2015/07/tumblr_noo5n9xupt1r3ptbfo1_540.gif?w=540&zoom=2)
